Analysis and study on text representation to improve the accuracy of the Normalized Compression Distance
نویسنده
چکیده
The huge amount of information stored in text form makes methods that deal with texts really interesting. This thesis focuses on dealing with texts using compression distances. More specifically, the thesis takes a small step towards understanding both the nature of texts and the nature of compression distances. Broadly speaking, the way in which this is done is exploring the effects that several distortion techniques have on one of the most successful distances in the family of compression distances, the Normalized Compression Distance -NCD-. The research carried out in this thesis can be divided into three parts. The first part, which corresponds to Chapter 5, experimentally evaluates the impact that several word removal techniques have on NCD-driven text clustering, with the aim of better understanding of both the nature of compression distances and the nature of textual information. This goal is accomplished by analyzing how the information contained in the documents and how the upper bound estimation of their Kolmogorov complexity progress as words are removed from the documents. One of the main conclusions that can be drawn from this analysis is that the clustering accuracy can be improved by applying a specific word removal technique. This distortion technique consists of removing the most frequent words of the language preserving the previous text structure. The second part of the thesis, which corresponds to Chapter 6, attempts to shed light on the reasons why the application of such a distortion technique can improve NCD-driven text clustering. The experimental results show that the maintenance of both the previous text structure and the remaining words structure have some relevance in the clustering behavior. The third part of the thesis, which corresponds to Chapter 7, applies the above mentioned distortion technique to NCD-driven document search. The application of compression distances to document search is not trivial due to the fact that they do not commonly perform well when the compared objects have very different sizes. An NCD-based document search engine that deals with that drawback by using passage retrieval, is used in the third part of the thesis. The results show that the search accuracy can be improved by applying the distortion technique presented previously. Summarizing, one of the distortion techniques explored in the thesis has been found to be beneficial both in NCD-based document clustering and in NCD-based document search.
منابع مشابه
Using Critical Discourse Analysis Based Instruction to Improve EFL Learners’ Writing Complexity, Accuracy and Fluency
The literature of ELT is perhaps overwhelmed by attempts to enhance learners’ writing through the application of different methodologies. One such methodology is critical discourse analysis which is founded upon stressing not only the decoding of the propositional meaning of a text but also its ideological assumptions. Accordingly, this study was an attempt to investigate the impact of critical...
متن کاملA New IRIS Segmentation Method Based on Sparse Representation
Iris recognition is one of the most reliable methods for identification. In general, itconsists of image acquisition, iris segmentation, feature extraction and matching. Among them, iris segmentation has an important role on the performance of any iris recognition system. Eyes nonlinear movement, occlusion, and specular reflection are main challenges for any iris segmentation method. In thi...
متن کاملFeature Extraction and Efficiency Comparison Using Dimension Reduction Methods in Sentiment Analysis Context
Nowadays, users can share their ideas and opinions with widespread access to the Internet and especially social networks. On the other hand, the analysis of people's feelings and ideas can play a significant role in the decision making of organizations and producers. Hence, sentiment analysis or opinion mining is an important field in natural language processing. One of the most common ways to ...
متن کاملA New IRIS Segmentation Method Based on Sparse Representation
Iris recognition is one of the most reliable methods for identification. In general, itconsists of image acquisition, iris segmentation, feature extraction and matching. Among them, iris segmentation has an important role on the performance of any iris recognition system. Eyes nonlinear movement, occlusion, and specular reflection are main challenges for any iris segmentation method. In thi...
متن کاملDirectional Stroke Width Transform to Separate Text and Graphics in City Maps
One of the complex documents in the real world is city maps. In these kinds of maps, text labels overlap by graphics with having a variety of fonts and styles in different orientations. Usually, text and graphic colour is not predefined due to various map publishers. In most city maps, text and graphic lines form a single connected component. Moreover, the common regions of text and graphic lin...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- AI Commun.
دوره 25 شماره
صفحات -
تاریخ انتشار 2012